feat(simd): Phase 3 scaffold — NEON tier flavors (baseline / dotprod / bf16)#176
Conversation
Splits the aarch64 dispatch surface into three tier files mirroring
the x86 v3/v4/native split. Each file documents the silicon, the
runtime and compile-time detection paths, and stubs out the tier-
specific types with intrinsic maps for future implementation.
src/simd_neon_baseline.rs
-------------------------
Tier floor — ARMv8.0-A `+neon` only. Pi 3 (A53), Pi 4 (A72), anything
that doesn't have dotprod/fp16/bf16. Native 128-bit lanes only;
composed 512-bit wrappers TODO (currently routed through `scalar::*`
fallback in `simd.rs:1593`). Placeholder for the future migration of
`simd_neon.rs::aarch64_simd` (lines 463-1126) into this file.
src/simd_neon_dotprod.rs
------------------------
ARMv8.2-A `+dotprod,+fp16`. Pi 5 (BCM2712, A76), Cortex-A75 and later,
Apple A11+, Snapdragon 8 Gen 1+. dotprod functions already implemented
in `simd_neon.rs:191-237` (will migrate to this file); F16 stubs new.
F16 intrinsic map documents the `vfmaq_f16` family with the stable-
Rust asm-byte workaround (issue #112800) following the AMX precedent
in `src/simd_amx.rs`.
src/simd_neon_bf16.rs
---------------------
ARMv8.6-A `+bf16` (or ARMv8.4-A + optional `+bf16`). Apple M2/M3/M4,
Snapdragon X Elite, Cortex-A510+, Graviton 3/4, Grace, Ampere One.
Apple M1 explicitly NOT in this tier (M1 is v8.5-A). Stubs `BF16x8`
(`bfloat16x8_t`) and `BF16x16` (`[bfloat16x8_t; 2]`). Documents BFMMLA
as the prize intrinsic (2×2 outer product in one instruction, ~32
GFLOP/s/core on M2 in bf16-matmul-bound kernels) and the asm-byte
fallback for stable Rust (issue #117222, mirrors `simd_amx.rs`).
.cargo/config-{pi5,apple-m2,graviton}.toml
------------------------------------------
Three cargo configs matching the x86 v3/v4/native triplet shape:
- config-pi5.toml → -Ctarget-cpu=cortex-a76 +dotprod,+fp16
- config-apple-m2.toml → -Ctarget-cpu=apple-m2 +bf16,+dotprod,+fp16,+i8mm
- config-graviton.toml → -Ctarget-cpu=neoverse-v2 +bf16,+dotprod,+fp16,+i8mm
src/lib.rs
----------
Three new gated module declarations (`#[cfg(all(target_arch =
"aarch64", feature = "std"))]`). No dispatch changes in `simd.rs` —
this PR is scaffold + docs only. The dispatch wiring lands in a
follow-up once the F16/BF16 implementations exist to dispatch to.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac2a9647de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # | ||
| # Also works on: | ||
| # - Cortex-X3 / X4 / X925 generic Linux servers | ||
| # - Ampere Altra (V1-class — same baseline) |
There was a problem hiding this comment.
Remove Altra compatibility claim from Graviton V2 config
This config hard-codes -Ctarget-cpu=neoverse-v2 plus +bf16,+i8mm (line 19), but line 12 says it also works on Ampere Altra; that pairing is unsafe because Altra-class deployments are commonly Neoverse-N1/v8.2 and do not provide the same feature baseline, so binaries built with this profile can execute unsupported instructions and crash with illegal-instruction faults on those hosts.
Useful? React with 👍 / 👎.
Summary
Phase 3 of the integration plan in
.claude/knowledge/simd-dispatch-architecture.md. Scaffold only — no dispatch changes, no behavior change. Lays out the structural skeleton + intrinsic maps + cargo configs that the actual NEON tier implementations will fill in.aarch64 isn't monolithic — splits the dispatch surface into three tier files mirroring the x86 v3/v4/native triplet:
src/simd_neon_baseline.rs— ARMv8.0-A+neonFloor tier. Pi 3 (A53), Pi 4 (A72), anything without dotprod/fp16/bf16. Documents the silicon and stubs out the future home of the existing
simd_neon.rs::aarch64_simd128-bit wrappers (I8x16/I16x8/U8x16/U16x8/U32x4/U64x2/I32x4/I64x2). Also lists the 8 missing 512-bit composed[neon_native; 4]wrappers (currently routed throughscalar::*atsimd.rs:1593).src/simd_neon_dotprod.rs— ARMv8.2-A+dotprod,+fp16Pi 5 (A76, BCM2712), Cortex-A75+, Apple A11+, Snapdragon 8 Gen 1+. dotprod functions already exist in
simd_neon.rs:191-237and will migrate here. F16 stubs new — documents the fullvfmaq_f16intrinsic map (vaddq_f16, vfmaq_f16, vsqrtq_f16, vaddvq_f16, …) with the stable-Rust asm-byte workaround following thesimd_amx.rsprecedent (Rust issue #112800 keeps the intrinsics nightly-only).src/simd_neon_bf16.rs— ARMv8.6-A+bf16Apple M2/M3/M4, Snapdragon X Elite, Cortex-A510+, Graviton 3/4, NVIDIA Grace, Ampere One. Apple M1 explicitly NOT in this tier (M1 is ARMv8.5-A, no BF16). Stubs
BF16x8(bfloat16x8_t) andBF16x16([bfloat16x8_t; 2]). DocumentsBFMMLAas the prize intrinsic — 2×2 outer product in one instruction, ~32 GFLOP/s/core on M2 in bf16-matmul-bound kernels. Same asm-byte fallback strategy (Rust issue #117222)..cargo/config-{pi5,apple-m2,graviton}.tomlCargo configs matching the x86 v3/v4/native triplet shape:
config-pi5.tomlcortex-a76+dotprod,+fp16config-apple-m2.tomlapple-m2+bf16,+dotprod,+fp16,+i8mmconfig-graviton.tomlneoverse-v2+bf16,+dotprod,+fp16,+i8mmRuntime detection (already wired)
simd.rs::detect_tier()already distinguishesTier::NeonvsTier::NeonDotProdviais_aarch64_feature_detected!("dotprod")at line 63. TheTier::NeonBf16variant + check is a TODO for when the bf16 impls land.What this is NOT
simd.rs— the new modules aren't wired intocrate::simd::*yet. They're declared inlib.rsaspub mod simd_neon_*so they participate in fmt/clippy/check from day one, but nothing reaches into them.F16x16Stub/BF16x8Stub/BF16x16Stubplaceholder structs are deliberately useless (unimplemented!()with pointers to module docs); they exist so consumers can grep for the name and find the implementation roadmap.#[cfg(target_arch = "aarch64")]gated).Why scaffold first
Without aarch64 CI silicon we can't verify byte-encoded asm correctness. Landing the scaffold + docs + intrinsic maps lets:
#[cfg(target_feature = "bf16")]arm.Test plan
cargo check --target=aarch64-unknown-linux-gnufrom any host — verifies module declarations + stubs compile clean.cargo --config .cargo/config-pi5.toml check --target=aarch64-unknown-linux-gnuonce the dotprod tier implementation lands.Generated by Claude Code